Systems and methods for tumor clonality analysis

ABSTRACT

Systems and methods of genomic analysis are presented that provide a framework to determine a tumor&#39;s clonality, the number and proportion of all major clones, and the variants that distinguish them. Contemplated systems and methods also allow phasing mutations to parental alleles to so time their emergence within the population of tumor cells, and provide an accurate estimate of the amount of contaminating normal tissue that was present in the tumor biopsy.

This application claims priority to our U.S. provisional applicationwith the Ser. No. 61/711,467, which was filed Oct. 9, 2012.

FIELD OF THE INVENTION

The field of the invention is computational analysis of genomic data,particularly as it relates to identification of clonality status of amixed cell population.

BACKGROUND OF THE INVENTION

With the increasing availability of whole genome data and theever-increasing speed of whole genome sequencing, enormous quantities ofdata are now available that demand a meaningful analysis to so provide aclinician or scientist with information to enable more effectivetreatment or drug development.

For example, multiple tumor and matched normal whole genome sequencesare now available from projects like ‘The Cancer Genome Atlas’ (TCGA),and extraction of relevant information is difficult. This is furthercompounded by the need for high genome sequencing coverage (e.g.,greater than 30-fold) to obtain statistically relevant data. Even incompressed form, such genomic information can often reach hundreds ofgigabytes, and a meaningful analysis comparing multiple of such largedatasets is in many cases very slow and difficult to manage, however,absolutely necessary to discover the many genomic changes that occurredin any given sample relative to a second sample. More recently, systemsand methods have been developed to allow for rapid generation ofinformation in a format that avoids massive output files as is describedin WO2013/074058. This publication and all other publications identifiedherein are incorporated by reference to the same extent as if eachindividual publication or patent application were specifically andindividually indicated to be incorporated by reference. Where adefinition or use of a term in an incorporated reference is inconsistentor contrary to the definition of that term provided herein, thedefinition of that term provided herein applies and the definition ofthat term in the reference does not apply.

While the system of the '058 application provides a significantimprovement over other known systems, various difficulties neverthelessare present. For example, most breast cancer is clinically andgnomically heterogeneous and is composed of several pathologically andmolecularly distinct subtypes, which often complicates genomic analysis.Moreover, currently known methods do not allow for deconvolution of suchgenomic diversity to so gain insight into possible tumor cell evolutionand resulting clonality among tumor cells in a tissue.

Thus, even though numerous methods of genomic analysis are known in theart, all or almost all of them suffer from several disadvantages. Mostsignificantly, heretofore known methods fail to allow identification oftumor progression on a molecular level, and with that fail to provideinsight into clonality and potential treatment efficacies. Viewed fromanother perspective, heretofore known methods failed to allowidentification of clonality and clonal relationship of cell populationswithin a sample containing multiple non-homogenous cells. Consequently,there is still a need to provide improved systems and methods forgenomic analysis, and especially systems and methods that provideinformation on clonality, clonal fraction, molecular tumor progression,and/or treatment options based on such information.

SUMMARY OF THE INVENTION

The present invention is directed to various systems, devices, andmethods for genetic analysis, and especially genomic analysis toidentify presence and distribution of distinct cell clones within asample containing one or more clonal populations of cells based ongenomic data obtained from the sample. In especially preferred aspects,analysis is based on genomic DNA from a tumor or otherwise abnormal cellpopulation, and allows not only determination of multiple clones withinthe tumor or cell population but also allows identification of likelyclonal evolution and/or clonal relationships.

In one aspect of the inventive subject matter, a method of ex-vivodetermining clonality of a tumor from sequencing data obtained from thetumor includes a step of determining from the sequencing data copynumber and allele fraction for an allele within the sequencing data, andanother step of calculating an allelic state for the allele based on thedetermined copy number and the determined allele fraction. The allelicstate is then used to determine the clonality of the tumor. While notlimiting to the inventive subject matter, it is generally preferred thatthe allelic state is plotted or displayed in an allelic state diagram(which may be a single or dual allelic state diagram).

In at least some embodiment of the inventive subject matterdetermination of the copy number and allele fraction is performed by asequence analysis program that produces local alignments by incrementalsynchronization of sequence strings (e.g., BAMBAM). Among other states,contemplated allelic states include normal copy number, single copyamplification, single copy/hemizygous deletion, copy-neutral loss ofheterozygosity, and amplification of both alleles.

In further contemplated embodiments of the inventive subject matter, theallelic state calculation comprises a correction for normalcontamination, uses majority and minority allelic states for tumor andnormal, and/or includes an identification of a mixture fraction Mb foran allele (which is either 0 or 1 for a monoclonal tumor, or greaterthan 0 and smaller than 1 when the tumor is polyclonal). It is stillfurther contemplated that the calculation of the allelic state may alsocomprises a correction for sequencing coverage level, particularly wherethe coverage level for the tumor is higher than the coverage level for acorresponding non-tumor (e.g., healthy) sample of the same patient.

Where desired, contemplated methods will further include a step ofdetermination of an allelic state landmark, which is preferably used todetermine a number of distinct (related or unrelated) clones in thetumor and/or a proportion of clones in the tumor. Additionally, oralternatively, it is still further contemplated that a mutation can belinked to a majority allele or a minority allele, and that themutational allele fraction can be employed for determination of timingof the mutation relative to a change in allelic state.

In another aspect of the inventive subject matter, a method of ex-vivovisualization of allelic states in a tumor includes a step ofdetermining a copy number and an allele fraction for an allele withinsequencing data, and a step of calculating the allelic state for theallele based on the determined copy number and the determined allelefraction. In a still further step, the allelic state of the allele ismapped in an allelic state diagram that plots copy number versus allelefraction (typically majority allele fraction).

Most typically, the allelic state diagram is presented such that eachvertex in the allelic state diagram corresponds to a tumor allelicstate, that clones with loss or gain of an allele in a polyclonal tumormap along edges drawn between vertices, and/or that clones with changesother than loss or gain of an allele in a polyclonal tumor map betweenedges drawn between vertices. It is still further contemplated that theallelic state diagram is adjusted for normal contamination. Of course,it should be appreciated that the allelic state diagram may be a dualallele state diagram.

Therefore, and viewed from a different perspective, the inventors alsocontemplate a method of analyzing genomic sequence data in which a BAMserver receives a plurality of genomic sequence reads, wherein theplurality of genomic sequence reads are obtained from a genome of atumor sample and a genome of a normal sample of a patient. The BAMserver then processes the genomic sequence reads to produce a pluralityof differential sequence objects that comprise a copy number and anallele fraction for an allele within the tumor genome. An analyticsengine (that is coupled to the BAM server) then processes the copynumber and the allele fraction for the allele to so determine an allelicstate for the allele.

In a typical embodiment of such methods, a differential sequencedatabase is coupled to the BAM server and the analytics engine such thatthe BAM server provides the differential sequence object to thedifferential sequence database and such that the differential sequencedatabase provides the differential sequence object to the analyticsengine. Furthermore, it is contemplated that a graphic output isgenerated by the analysis engine that plots the allelic state for theallele in an allelic state diagram.

In a further contemplated aspect of the inventive subject matter, amethod of ex-vivo characterizing genomic information from a tumorincludes a step of determining an allelic state for an allele in thetumor genome, and a further step of using the determined allelic stateto identify the tumor as being a monoclonal tumor or as comprising atleast two distinct tumor clones.

In such methods, it is further contemplated to use the determinedallelic state to identify a relationship of the tumor clones (e.g., asbeing distinct and unrelated or as being related). Where the clones arelated, it is contemplated that the determined allelic state can beemployed to identify a clonal history for the distinct tumor clones.

Thus, the inventors also contemplate a method of ex-vivo characterizinga tumor clone in a tumor mass in which in one step genomic sequenceinformation from the tumor mass is obtained (e.g., from a BAM server).In another step, the genomic information is used to determine an allelicstate for an allele in the tumor genomic sequence information. In afurther step, the location of the allelic state for the allele in anallele state diagram is determined (e.g., in a graphic display or insilico, or numerically), and the location is used to identify the cloneas monoclonal or polyclonal. For example, a clone is monoclonal when thelocation of the allelic state is on a vertex of the allelic statediagram.

In yet another aspect of the inventive subject matter, the inventorscontemplated a method of providing treatment information for treatmentof a tumor. In such method, allelic state information for the tumor isascertained, and presence or emergence of (a) a clone or (b) anevolutionary pattern of clones is ascertained within the tumor that isindicative of at least one of susceptibility of the tumor to treatmentwith a drug, and an increased risk of drug resistance or metastaticpotential. Most typically, the step of identifying presence or emergenceis based on prior treatment data or a priori known data.

Various objects, features, aspects and advantages of the inventivesubject matter will become more apparent from the following detaileddescription of preferred embodiments, along with the accompanyingdrawing figures in which like numerals represent like components.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is an exemplary illustration of evolution of a tumor startingfrom a germline cell, to an initial tumor cell, to a population of majorand minor clones that are sampled by the tumor biopsy.

FIG. 2 is an exemplary allelic state diagram (ASD) of simulated data fora monoclonal tumor sample with zero normal contamination, α=0.Chromosomal regions exhibiting different copy number alterations areplotted in different shades. This simulated tumor genome exhibits 6allelic states: normal, single-copy amplification, hemizygous deletion,homozygous deletion, copy-neutral loss of heterozygosity (LOH), andmulti-copy biallelic amplification. (supported by [0009]).

FIGS. 3A-3D depict an exemplary set of allelic state diagrams for thesimulated monoclonal tumor genome of FIG. 2 with different levels ofnormal contaminant: FIG. 3A illustrates 0% normal contaminant, FIG. 3Billustrates 10% normal contaminant, FIG. 3C illustrates 50% normalcontaminant, and FIG. 3D illustrates 90% normal contaminant, indicatingthe difference is resolution as a function of level of normalcontamination.

FIG. 4 is the allelic state diagram of FIG. 2 showing some possiblebidirectional and unidirectional transitions between allelic stateswhere unidirectional transitions are those that involve irreversibleloss of a parental chromosome.

FIG. 5 is an allele state diagram for a tumor genome transitioning fromthe allelic states presented in the previous figures to new allelicstates that differ only by a single copy loss or gain. Here,transitional allelic states are created when the tumor comprises amixture of two different clones/subclones: Clone A is defined by theoriginal allelic states: (2,1), (5,2), and (1,0). Clone B alters thesestates through amplifications and a deletion to produce the allelicstates: (2,2), (4,2), and (2,0). The percentages denote the percentageof clone B present in the tumor population, where 0% describes amonoclonal population of clone A, and 100% is a monoclonal population ofclone B.

FIG. 6 is an exemplary illustration of an allelic state diagram of theFIG. 2 showing the transitional allelic states produced when allelicstates are “skipped”, which can occur when the tumor consists of two ormore unrelated, or distantly-related clones. In that case, thetransitional states are not found on the edges connecting allelic statesif the allelic states of the two major clones differ in both majorityand minority alleles.

FIGS. 7A and 7B are exemplary illustrations of allelic state diagramsfor two glioblastoma multiforme (GBM) tumors: GBM-06-0145 (FIG. 7A) andGBM-06-0185 (FIG. 7B). The fitted parameters found normal contaminationat 21.5% and 14.6%, respectively. FIG. 7A depicts only clonal allelicstates and no evidence of transitional allelic states, indicating thatGBM-06-0145 is a monoclonal tumor, while FIG. 7B depicts both clonalstates and multiple transitional allelic states. Since the transitionalallelic states (marked with (*)) feature three different mixturepercentages, this polyclonal tumor must consist of at least threesub-clones. The black X's plotted in FIG. 7B represent “landmark”allelic states suitable for use to determine the clonal mixture ofGBM-06-0185.

FIGS. 7C and 7D are exemplary illustrations of allelic state diagramsfor two lung squamous cell carcinoma (LUSC) tumor: LUSC-34-2596 (FIG.7C)and LUSC-66-2756 (FIG. 7D).

FIG. 8 is an exemplary plot depicting the monoclonal karyotype forGBM-06-0145. The “Relative Coverage” and “Allele Fraction” displayed atthe top of the plot shows both the observed results output by BamBam andthe computed coverage and allele fraction generated by modeling themixture of the single clone and normal contamination. The comparison ofreal versus modeled data shows very strong agreement. The clone'skaryotype below shows the majority and minority allelic states for thetumor genome, showing amplification of one copy of entire chr7 andchr19, complete loss of one copy of chr10, and arm-level loss of chr9p.

FIG. 9 is an exemplary plot depicting the polyclonal karyotypes forGBM-06-0185. A total of 4 distinct clones were identified in this tumor,with clone D determined to be the dominant clone of the population,comprising 42.7% of the tumor sample. All clones have single-copyamplifications of chr7, chr19 & chr20, single copy loss of chr10 andchr22, and loss of chr9p in common. Clones B, C, & D all have deletionsin chr6, but clone B's deletions are focal while clones C & D displayarm-level loss of chr6q. Clone D is further distinguished byamplification of the intact copy of chr9.

FIG. 10 is an exemplary plot depicting the polyclonal karyotypes forGBM-06-0152. Three clones were identified in this tumor sample that hasan estimated 24.1% normal contamination. All clones share amplificationsof chr1, chr19 & chr20, deletions of chr10 & chr22, and focal losses ofchr12 related to the chromothripsis-like event that created two DMsdescribed in the previous chapter. Clones B & C exhibit amplification ofchr7 and deletions of the nonamplified copy of chr1 as well as chr2,chr3, chr4, chr8, chr13, and chr17. Clone C further amplifies theremaining copy of chr8.

FIG. 11 is an exemplary plot depicting the polyclonal karyotypes forGBM-06-1086. Four clones were identified in this tumor sample that hasan estimated 7.5% normal contamination. All clones share amplificationof chr21 and deletions of chr9 & chr11p. Clones C & D exhibitsignificant chromosomal loss, deleting chr1, chr3, chr4, chr5, chr6,chr8, chr10, chr13, chr14, chr15, chr17, chr18, and chr20. The dominantclone D, making up 41.6% of the tumor sample, further deletes the soleremaining copy of chr18 and amplifies chr19. The black arrows indicatethe position of CDK2NA in clones A & B, highlighting the arrival of thefocal deletion of CDKN2A in the latter clone.

FIG. 12 is an exemplary illustration of phased mutations on the dualASD. A representative region on the tumor genome is shown, consisting ofa region in the single copy gain allelic state, a region in the “normal”allelic state, a region exhibiting copy-neutral loss of heterozygosity(CN-LOH), and a region exhibiting LOH. Three mutations are found in theamplified region, two majority-phased (red stars) and oneminority-phased (blue star). Two mutations are found on the “normal”allelic state, one majority-phased and another minority-phased. Bothregions exhibiting LOH have one mutation each phased to the soleremaining allele, and are thus majority-phased. The dual ASD below showswhere each of these mutations would be found, using each mutation'scorrected allele fraction, MAFc, to determine its location along thex-axis. Note the different placement of the two majority-phasedmutations in the single-copy gain allelic state, where only the mutantallele that exists on both majority alleles (i.e. mutated beforeamplification) is found near the single copy gain majority allelicstate. The other is found near the single copy gain minority allelicstate, correctly identifying that the mutation exist on only one copy ofthe majority allele. Finally, note that the minority-phased mutations inblue are all found towards the left-half of the dual ASD.

FIG. 13 is an illustration of phased mutations on the dual ASD for tumorGBM-06-0145. 7 regions are encircled on these plots: (a) majority-phasedto an amplified allelic state yet presents with MAFc suggesting themutation is only on one of the two copies, (b) majority-phased to anamplified allelic state with MAFc suggesting mutation is present on bothamplified copies, (c) majority-phased with allele fraction consistentwith the LOH allelic state, (d) minority-phased in themajority-amplified allelic state with MAFc consistent with single copy,(e) unphased mutations with MAFc consistent with amplified allelicstate, and (f) unphased mutations with MAFc consistent with LOH allelicstate

FIG. 14 is an illustration of phased mutations on the dual ASD for tumorLUSC-34-2596. The two encircled regions, (a) and (b), show a number ofmutations phased to the majority and minority alleles, respectively, inthe balanced amplified allelic state (2,2). One majority-phased mutationin NDRG1 is found in a transitional allelic state with matching MAFc.The locations of two missense mutations, in BRAF & DNMT3A, and onenonsense mutation in TP53 are shown in the unphased plot, placing BRAFin a highly amplified allelic state, DNMT3A in the “normal” allelicstate, and TP53 in the CN-LOH state.

DETAILED DESCRIPTION

The inventors have discovered that clonality of a geneticallyheterogeneous sample can be readily resolved using an approach that usesan allelic state model (e.g., expressed as allelic state diagram), andthat the so obtained clonality information can be used for variouspurposes, including analytic, prognostic, and diagnostic uses.

For example, the methods and systems contemplated herein provide theability to computationally dissect a tumor's population using wholegenome sequencing data, and where desired, to visually assess a tumorsample's clonality using an allelic state diagram. Viewed from adifferent perspective, the clonal mixture of a tumor can now bedetermined by decomposition of the tumor population into the majorclones of the tumor cell population and by estimation of normalcontamination to account for the copy number and allele fraction (whichis preferably performed using BamBam, as described in WO2013/074058).Still further, contemplated systems and methods all for a determinationand phasing of whole genome karyotypes of all major clones, which inturn allows inferring the phylogenetic tree of polyclonal tumor genomesto time the emergence of clone-specific copy number alterations.Finally, by using phasing and mutant allele fraction, emergence ofmutations can be timed with respect to their encompassing copy numberalterations.

Therefore, in one aspect of the inventive subject matter, it should beappreciated that clonality and timing information will help betterunderstand the dynamic nature of individual tumors, which may bereflective of a tumor type, or an individual's or tissue's response tothe presence or development of the tumor. Remarkably, all of thisinformation can be discovered from just a single tumor biopsy, makingcontemplated systems and methods particularly useful in an ex vivodiagnostic approach.

In another aspect of the inventive subject matter, it should beappreciated that the phylogenetic-based mutation models contemplatedherein can be employed to analyze the mutations of related samples(e.g., primary tumor and its metastases) to so reconstruct themutational history of a cancer as it spread. The ability to determine atumor's clonality and identify all of the major clones that comprise thegrowing tumor mass, all from the whole genome sequencing data of asingle biopsy, opens up a wide variety of potential clinicalapplications. For example, in a scenario where a newly-diagnosedpatient's tumor is biopsied and all of the major clones are identifiedvia clonal analysis. A clinician could then use this clonality analysisto tailor the patient's treatment according to the alterations specificto the clone furthest up the evolutionary tree, the progenitor tumorcell, with the hope that treating the initial tumor mass, derivativeclones will also be targeted. On the other hand, in a scenario where apatient is diagnosed with a slow-growing tumor that can be safelymonitored for a longer period of time before surgery or the beginning ofchemotherapy, clonal analysis of a series of biopsies could be performedand, by tracking the clonal composition of the tumor through time, aclinician can identify the clones that are growing most rapidly. Bydesigning a treatment that targets not what is currently the dominantclone, but the clone that is set to become the dominant clone, mightmore effectively treat the cancer.

Clonality analysis is also contemplated to prove useful in betterunderstanding the metastatic spread of cancers. In such scenario, clonalanalysis of a primary tumor and a series of metastases is used todetermine all of the major clones present in the spreading tumor. Byinspecting the clonal composition of the primary and each metastasis,one can determine how each clone spreads and discover if one or moreparticular clones exist that show increased metastatic potential. Bydetermination of the characteristics unique to the metastatic clones,the inventors contemplate that identification of emergence of thesecharacteristics in minor clones of another patient's primary tumor an“early warning” signal may be developed for determination of likelihoodof imminent metastasis.

With respect to methods of data acquisition for clonality analysis, itis preferred that genomic analysis to identify copy number and allelefractions are determined using systems and methods in which multiplerelatively small genomic sequence sub-strings (e.g., short reads fromsequencing runs) of respective larger genetic sequence strings from afirst and second tissue sample (e.g., healthy and diseased tissue) areobtained. The genetic sequence strings are then incrementallysynchronized using one or more known positions of at least one ofcorresponding sub-strings to so produce a local alignment. The sogenerated local alignment is then analyzed (typically using a referencegenomic sequence) to generate a local differential string between thefirst and second sequence strings within the local alignment that thuscontains significant differential information (typically relative to thereference genomic sequence). A differential genetic sequence object fora portion or even the entire genome is then created using the localdifferential string, and most typically a plurality of localdifferential strings. It should be noted that incrementalsynchronization to produce local alignments and differential informationprovides various technical advantages, including a significant increasein processing speed of an entire genome, as well as the capability toproduce allele specific information (e.g., copy number, allele fraction,etc.)

In such systems and methods, it should be appreciated that instead ofprocessing two extremely large files to generate another extremely largeintermediate (or even output) file, genome wide analysis can be achievedin multiple significantly smaller portions wherein the smaller portionsare aligned to a reference genome using known positions within thegenome of one or more sub-strings. Viewed from another perspective,alignment is performed by incremental synchronization of sequencestrings using known positions of substrings and a reference genomesequence, and an output file can be generated that comprises onlyrelevant changes with respect to a reference genome. Thus, theprocessing speed is significantly improved and the amount of datarequired for production of a meaningful output is dramatically reduced.Still further, it should be noted that such systems and methods allow,inter alia, haplotyping/somatic and germline variant calling, anddetermination of allele-specific copy numbers. Moreover, the systems andmethods presented herein are suitable for use with sequence informationin SAM/BAM-format.

For example, multiple sequencing fragments (e.g., short reads from atumor sample of a donor and corresponding non-tumor sample of the samedonor) are aligned to the same reference genome, which is employed toorganize the sequencing fragments from the samples. Thus, such methodsuse two sequencing fragment datasets (one from the tumor, the other fromcorresponding normal “germline” tissue) from the same patient and thereference genome, and reads the datasets such that all sequences in bothdatasets overlapping the same genomic position (based on the referencegenome and annotation in sub-strings) are processed at the same time.This is the most efficient method for processing such data, while alsoenabling complex analyses that would be difficult or impossible toaccomplish in a serialized manner, where each dataset is processed byitself, and results are only merged afterwards. A particular suitablesystem is described in WO2013/074058, incorporated by reference herein.

With the release of multiple fully-sequenced tumor and matched normalgenomes from projects like The Cancer Genome Atlas (TCGA), there isgreat need for tools that can efficiently analyze these enormousdatasets.

To this end, we developed BamBam, a tool that simultaneously analyzeseach genomic position from a patient's tumor and germline genomes usingthe aligned short-read data contained in SAME AM-formatted files(SAMtools library; Li H, Handsaker B, Wysoker A, Fennell T, Ruan J,Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project DataProcessing Subgroup. The Sequence Alignment/Map format and SAMtools.Bioinformatics. 2009 Aug 15; 25(16):2078-9. Epub 2009 Jun 8). BamBaminterfaces with the SAMtools library to simultaneously analyze apatient's tumor and germline genomes using short-read alignments fromSAM/BAM-formatted files. In the present disclosure the BamBam tool canbe a sequence analysis engine that is used to compare sequences, thesequences comprising strings of information. In one embodiment, thestrings of information comprise biological information, for example, apolynucleotide sequence or a polypeptide sequence. In anotherembodiment, the biological information can comprise expression data, forexample relative concentration levels of mRNA transcripts or rRNA ortRNA or peptide or polypeptide or protein. In another embodiment, thebiological information can be relative amounts of protein modification,such as for example, but not limited to, phosphorylation, sulphation,actylation, methylation, glycosilation, sialation, modification withglycosylphosphatidylinositol, or modification with proteoglycan.

This method of processing enables BamBam to efficiently calculateoverall copy number and infer regions of structural variation (forexample, chromosomal translocations) in both tumor and germline genomes;to efficiently calculate overall and allele-specific copy number; inferregions exhibiting loss of heterozygosity (LOH); and discover bothsomatic and germline sequence variants (for example, point mutations)and structural rearrangements (for example, chromosomal fusions.Furthermore, by comparing the two genome sequences at the same time,BamBam can also immediately distinguish somatic from germline sequencevariants, calculate allele-specific copy number alterations in the tumorgenome, and phase germline haplotypes across chromosomal regions wherethe allelic proportion has shifted in the tumor genome. By bringingtogether all of these analyses into a single tool, researchers can useBamBam to discover many types of genomic alterations that occurredwithin a patient's tumor genome, often to specific gene alleles, thathelp to identify potential drivers of tumorigenesis.

To determine if a variant discovered is somatic (that is, a variantsequence found only in the tumor) or a germline (that is, a variantsequence that is inherited or heritable) variant requires that wecompare the tumor and matched normal genomes in some way. This can bedone sequentially, by summarizing data at every genomic position forboth tumor and germline and then combining the results for analysis.Unfortunately, because whole-genome BAM files are hundreds of gigabytesin their compressed form (1-2 terabytes uncompressed), the intermediateresults that would need to be stored for later analysis will beextremely large and slow to merge and analyze.

To avoid this issue, BamBam reads from two files at the same time,constantly keeping each BAM file in synchrony with the other and pilingup the genomic reads that overlap every common genomic location betweenthe two files. For each pair of pileups, BamBam runs a series ofanalyses listed above before discarding the pileups and moving to thenext common genomic location. By processing these massive BAM files withthis method, the computer's RAM usage is minimal and processing speed islimited primarily by the speed that the file system can read the twofiles. This enables BamBam to process massive amounts of data quickly,while being flexible enough to run on a single computer or across acomputer cluster. Another important benefit to processing these fileswith BamBam is that its output is fairly minimal, consisting only of theimportant differences found in each file. This produces what isessentially a whole-genome diff between the patient's tumor and germlinegenomes, requiring much less disk storage than it would take if allgenome information was stored for each file separately.

BamBam is a computationally efficient method for surveying largesequencing datasets to produce a set of high-quality genomic events thatoccur within each tumor relative to its germline. These results providea glimpse into the chromosomal dynamics of tumors, improving ourunderstanding of tumors' final states and the events that led to them.

BamBam enables a rapid comparison of tumor (somatic) and germlinematched sequencing datasets. The results output by BamBam are varied,producing an exhaustive catalogue of the somatic and germline variantscontained by each patient's samples. This catalogue provides researcherswith the ability to quickly find important changes that occurred duringthe tumor's development, but also provide high-quality variants presentin the patient's germline that may indicate predisposition to disease.Further improvements of BamBam will consist of methods that specificallysearch for multiple types of variants occurring in the same genomicregion (for example, one allele of a gene deleted, the other allelecontaining a truncating mutation by breakpoint) that may point todrivers of tumorigenesis. We also plan to extend BamBam's ability toprocessing more than pairs of genomes, as well as provide researcherswith the ability to plug in their own analysis methods into BamBam'spipeline.

In additional embodiments, the polynucleotide nucleic acids may be usedin any molecular biology techniques that have yet to be developed,provided the new techniques rely on properties of nucleic acid moleculesthat are currently known, including, but not limited to, such propertiesas the triplet genetic code and specific base pair interactions.

In a genetic sequence analysis ecosystem, which includes sequenceanalysis engine coupled with one or more databases, possibly overnetwork (for example, LAN, WAN, VPN, Internet, etc.). Preferreddatabases include genetic database storing genetic sequence strings forone or more tissues, differential sequence database storing differentialgenetic sequence objects representing local differential strings, andmedical records database storing one or more medical records associatedwith a patient, person, population, or other type of entities. Medicalrecords database can also store one or more differential geneticsequence objects, possibly associated with patients, persons,populations or other groups.

One aspect of the inventive subject matter is considered to includemanagement of differential genetic sequence objects. Through analysis ofgenetic sequence strings, analysis engine can create differentialstrings or constellations of differential strings. Differential stringscan be converted to differential genetic sequence objects, which in turncan be stored in differential sequence database or medical recordsdatabase. The sequence objects can be tagged with one or more attributesdescribing the nature of the objects. Example attributes can includetime stamps of object creation, time stamp of when sample was taken froma patient, patient name, demographic information, tissue type (forexample, healthy, diseased, tumor, organ tissue, etc.), or otherfeatures. The attributes can by leveraged by analysis engine toestablish one or more correlations among characteristics associated withmedical records in medical records database.

Management of differential genetic sequence objects covers a broadspectrum of roles or responsibilities. As discussed above, one aspectincludes creation of such objects. Analysis engine is also preferablyconfigured to update, analyze, modify, track in time, delete, copy,split, append, or otherwise manipulate the sequence objects as desired.Further, analysis engine can provide a differential genetic sequenceobject management interface, possibly on output device. For example, insome embodiments, ecosystem operates as a for-fee service comprising oneor more web servers available over the Internet. In such an embodiment,a computer with a browser can interface with analysis engine to manageor interact with the differential genetic sequence objects.

In some embodiments, as discussed further below, analysis engine isconfigured to analyze genetic sequence strings obtained from geneticdatabase. Preferably the genetic sequence strings are associated withinat least two different tissue samples. Analysis engine produces one ormore local alignments by incrementally synchronizing at least twosequences using at least a known position of corresponding sub-stringsin the sequence strings. Further, analysis engine uses the localalignment to generate one or more local differential strings orconstellations of differential strings between the genetic sequencestrings. Analysis engine can then use the differential strings to updatedifferential genetic sequence objects in differential sequence databaseor medical records database. The differential sequence objects can thenbe used for further analysis.

In some embodiments, analysis engine communicatively couples withmedical records database that stores differential genetic sequenceobjects for specific patients, persons, individuals, families,populations, or other groups. Analysis engine obtains a differentialsequence object for a patient and produces a patient specific data setbased on presence of a local differential string or constellation ofdifferential string associated with the patient's sequence object. Then,analysis engine can leverage the patient-specific data set to generateor otherwise produce one or more patient specific instructions. Forexample, through analysis of the patient's specific local differentialstrings, analysis engine can determine if there is a correlation betweenthe patient's specific differential strings and known conditions, whichin turn can be mapped to instructions. Contemplated instructions caninclude a diagnosis, a prognosis, a recommended treatment, a prediction,a prescription, or other type of instructions.

In yet other embodiments, analysis engine obtains differential geneticsequence objects stored in medical records database where the sequenceobjects are associated with a population of individuals. The analysisengine identifies a constellation of local differential strings frommultiple sequence objects and generates constellation record from theconstellation. Constellation record comprises a representation ofinformation (for example, attributes, properties, metadata,characteristics, etc.) related to local differential strings associatedwith the population Analysis engine uses constellation records togenerated population analysis record. Thus, the differential geneticsequence objects can be mapped to population segments.

Still another embodiment includes analysis engine using the differentialgenetic sequence object to determine an extent that a person's geneticsequence deviates from a reference sample. A reference differentialgenetic sequence object, possibly representing a real person or acanonical person, can be stored as a medical record in medical recordsdatabase. Analysis engine calculates a deviation between a person'slocal differential strings from different sequence objects associatedwith the person and the local differential strings from the referencedifferential genetic sequence object. Once the deviation is calculated,analysis engine generates a deviation record representing the deviationor departure. Similar to other records in the system, deviation recordcan also include attributes reflecting the characteristics of theinformation in the record (for example, person name, time stamps, sampletypes, etc.). Analysis engine can then leverage deviation record togenerate person-specific deviation profile indicating how or to whatdegree the person genetic sequences deviate from the referencedifferential stings.

Regardless of the type of analysis or result generated (for example,patient instructions, population analysis, person-specific profile,etc.), analysis engine can further configuration output device topresent the result. Output device preferably comprises a computingdevice coupled with analysis engine, possibly over network. Examples ofoutput device include cell phones, information kiosks, computerterminals at point of care, insurance company computers, printers,imaging devices, genomic browsers, or other types of devices.

Using a system according to the inventive subject matter will thereforetypically include a genetic database. As already noted above, it shouldbe appreciated that the genetic database may be physically located on asingle computer, however, distributed databases are also deemed suitablefor use herein. Moreover, it should also be appreciated that theparticular format of the database is not limiting to the inventivesubject matter so long as such database is capable of storing andretrieval of first and second genetic sequence strings representingrespective first and second tissues, wherein the first and secondsequence strings have a plurality of corresponding sub-strings.

Likewise, it should be noted that the particular format of the first andsecond genetic sequence strings is not limiting to the inventive subjectmatter so long as first and second genetic sequence strings will includeone or more corresponding sub-strings for which the location in a genomeis known. Therefore, suitable data formats will include simple ASCII orbinary code, and the sequence strings may be formatted followingspecifications commonly employed in currently known sequence analytictools. Therefore, especially preferred formats include EMBL, GCG, fasta,SwissProt, GenBank, PIR, ABI, and SAM/BAM format.

Depending on the particular nature of analysis and samples, the type ofgenetic sequence strings may vary considerably, and it should be pointedout that the sequences may be nucleic acid sequences (DNA or RNA) aswell as protein sequences. Most typically, however, the genetic sequencestrings will be nucleic acid strings that will represent significantportions of the genome, transcriptome, and/or proteome of the first andsecond tissues under analysis. For example, it is contemplated that thefirst and second genetic sequence strings represent at least 10%, moretypically at least 25%, more typically at least 50%, even more typicallyat least 70%, and most typically at least 90% or even substantially theentire (at least 98%) genome, transcriptome, or proteome of the firstand second tissues. Thus, it should be appreciated that the systems andmethods presented herein will allow for a rapid and highly comprehensiveoverview of significant differences between first and second tissueswhile producing a compact and informative output file.

Depending on the type of tissue under investigation, it should be notedthat multiple types of analyses can be performed. For example, where thefirst and second tissues originate from the same biological entity,healthy tissue may be compared against a different healthy tissue orhealthy tissue may be compared against a corresponding diseased tissue(for example, tumor tissue). Thus, the biological entity may be ahealthy individual or an individual diagnosed with a disease ordisorder. On the other hand, where first and second tissues are derivedfrom a cell line (immortalized or primary), genetic effects orepigenetic effects of drugs may be rapidly identified. Similarly, wherethe first and second tissues are derived from a stem cell, changes ingenetic composition or genetic plasticity of the developing embryo maybe analyzed. In still further contemplated examples, the first andsecond tissue may be of an experimental animal model to investigateprogression of a disease or effect of a treatment. Alternatively, firstand second tissue may even be from a yeast, recombinant bacterial cell,and/or a virus.

Consequently, it should be recognized that the nature of thecorresponding sub-strings will vary considerably and will at least inpart depend on the type of tissue sampled and on the amount of genomiccoverage. However, it is typically preferred that the genomic coverageis relatively high and that in most cases the entire genome is analyzed.Thus, corresponding sub-strings will typically include homozygous andheterozygous alleles.

Regardless of the type of sub-strings, it is generally preferredsynchronizing will include a step of aligning at least one of theplurality of sub-strings based on an a priori known location within thefirst string. As numerous genomes for various organisms (and especiallyhuman) are already substantially completely annotated and as evenunknown sequences are often annotated with at least a putative function,and as substantially the (linear) sequence entire genomes are known, thenumber of priori known locations with respect to a reference genome ishigh. Thus, knowledge of annotations within the reference genome willserve as a roadmap for effective and accurate synchronization. Ofcourse, it should be appreciated that the nature of the reference genomeis not necessarily limited to a genome of a single healthy tissue, butthat the reference genome may be any defined (actual or calculated)genomic structure. For example, the reference genome may be constructedfrom a (typically single tissue of a) plurality of healthy individualsto so generate a consensus reference sequence. Alternatively, thereference string may be based on a consensus of multiple tissues of thesame (or different) individual, or on a consensus of diseased tissuesamples (from the same or multiple patient).

Consequently, it should be recognized that the differential geneticsequence object will provide information of one or more sample tissue(s)relative to a reference tissue. Thus, and depending on the choice of thereference string, the information content for the differential geneticsequence object may vary considerably. For example, the differentialgenetic sequence object may provide information that the sample is amatch for a particular subpopulation (as defined by the referencestring) or that the sample has a plurality of mis-matches that may ormay not be associated with a disease or condition.

In further preferred aspects of the inventive subject matter, thesynchronization may also be performed by aligning the sub-string(s)within a window having a length of less than a length of the at leastone of the plurality of sub-strings. Most preferably, synchronization isperformed by iteratively and incrementally synchronizing the first andsecond sequence strings throughout the entire length of the firstsequence string. Viewed from a different perspective, synchronizing willthus be performed in a manner similar than that of a zipper in which thetwo halves are incrementally matched up to produce an alignment. Usingthe same image, only mis-matched portions of the closed zipper are thenreflected in the differential genetic sequence object.

Consequently, it should thus be recognized that the differential geneticsequence object will represent one or more local differential strings,typically at least for a defined portion of the genome (for example, atleast one chromosome), and more typically for substantially the entiregenome of the first or second tissue. Of course, it should be noted thatbased on the already known position and/or determined deviation from thereference string, the differential genetic sequence object willtypically include one or more attributes with metadata describing thedifferential genetic sequence object. For example, the attribute may bedescriptive of a state of the first and/or second tissues. Where thestate is a physiological state, the metadata may reflect neoplasticgrowth, apoptosis, state of differentiation, tissue age, and/orresponsiveness to treatment for the tissue. On the other hand, where thestate is a genetic status, the metadata may reflect ploidy, gene copynumber, repeat copy number, inversion, deletion, insertion of viralgenes, somatic mutation, germline mutation, structural rearrangement,transposition, and/or loss of heterozygosity. Similarly, the state mayinclude pathway model information that is associated with a signalingpathway within the tissues (for example, anticipated responsiveness todrugs, defects in receptors, etc.), and especially contemplated pathwaysinclude signaling pathways (for example, growth factor signalingpathway, transcription factor signaling pathway, apoptosis pathway, cellcycle pathway, hormone response pathway, etc.).

Output information provided by the systems and methods presented hereinmay be in form of a single differential genetic sequence objectindicating multiple deviations from the reference string, or more thanone differential genetic sequence object indicating individualdeviations from the reference string, or any reasonable combinationthereof. Most typically, the differential genetic sequence object willbe in electronic format, and thus be retrieved and/or transferred as acomputer readable file. As will be readily recognized the file is mostpreferably standardized, and it is especially preferred that the formatconforms to a SAM/BAM format.

In light of the above, it should thus be appreciated that thedifferential genetic sequence object may be used in a variety ofmanners, and that the differential genetic sequence object is especiallysuitable for numerous applications in healthcare, population analysis,and personalized medicine.

For example, where one or more differential genetic sequence objects areknown for an individual, a patient-specific data set may be producedthat is based on a local differential string or on a constellation ofmultiple local differential strings in the differential genetic sequenceobject for the patient, and the patient-specific data set is then usedto produce a patient-specific instruction. In a typical example, theinventors contemplate a method of providing a health care service inwhich an analysis engine is coupled to a medical records storage devicethat stores a differential genetic sequence object for a patient. Theanalysis engine will then generate patient-specific data using one ormore local differential strings or a constellation of a plurality oflocal differential strings in the differential genetic sequence objectfor the patient, and produce a patient-specific instruction based on thepatient-specific data set.

It should be appreciated that the medical records storage device may beconfigured in numerous manners and may be portable by the patient (forexample, smart-card carried by the patient), accessible by the patient(for example, via smart phone), or remotely stored on a server that isaccessible by the patient or medical professional of the patient. As canbe taken from the discussion above, the differential genetic sequenceobject for the patient may include any number of local differentialstrings (i.e., sequence deviations at a specific position in the genomerelative to a reference genome), and the local differential strings maybe located in a defined area of the genome, on or more chromosomes, oreven in throughout the entire genome. Similarly, the differentialgenetic sequence object may comprises multiple local differentialstrings that represent at least two tissue types (for example, healthyversus diseased), or at least two temporally spaced results for the sametissue (for example, prior to treatment with a particular drug at aparticular regimen and after treatment commences).

Thus, and viewed from a different perspective, it should be noted thatmedically relevant information for the entire genome (or a fractionthereof [for example, chromosome or contiguous sequence stretch]) can beexpressed as a deviation record having one or more local differentialstrings, and that the information can be used to compare against adatabase that contains treatment options, diagnoses, and/or prognosesassociated with or for the local differential string. Where multiplelocal differential strings are present, it is noted that the combinationof selected local differential strings may be indicative of a condition,predisposition, or disease, and that such constellation of multiplespecific local differential strings may be used to generate thepatient-specific data, which is then used to generate thepatient-specific instruction. Thus, the nature of the patient-specificinstruction will vary considerably, and may be a diagnosis, a prognosis,a prediction of treatment outcome, a recommendation for a treatmentstrategy, and/or a prescription.

In yet another preferred use of contemplated differential geneticsequence objects, the inventors discovered that genetic analysis ispossible not only for individuals, but that also population-wideanalyses can be conducted in a rapid and effective manner using thesystems and methods presented herein. For example, in a method ofanalyzing a population, a plurality of differential genetic sequenceobjects (for example, for a plurality of individuals) are stored in amedical records database of a population, and an analysis engine willidentify a constellation of a plurality of local differential strings(for example, based on polymorphisms, epigenetic changes, etc.) withinthe plurality of differential genetic sequence objects to produce aconstellation record, which is then used to generate a populationanalysis record.

For example, the constellation record can be prepared for bloodrelatives, members of the same ethnic group or race, a populationworking in the same occupation, a population living in a selectedgeographic location. Alternatively, the population may also be definedby having members that share exposure to a pathogen or noxious agent,health history, treatment history, treatment success, gender, species,and/or age. Thus, it should be recognized that the constellation recordis a genome-wide analytic tool that will allow identification ofindividuals as belonging to one or more specific groups as defined bythe constellation record. Thus, the constellation record and associatedmethods may be useful to determine paternity or maternity or may beuseful to generate a patient-specific record in view of theconstellation record. For example, the patient-specific record mayreveal predisposition to a disease or condition, or sensitivity tocertain drugs or other agents. Consequently, the patient-specific recordmay present a risk assessment and/or an identification of the patient asbelonging to a specified population. Alternatively, the patient-specificrecord may include a diagnosis, a prognosis, a prediction of treatmentoutcome, a recommendation for a treatment strategy, and/or aprescription that is typically at least in part based on a comparison ofthe constellation record of the patient with a population analysisrecord.

In a still further preferred use of contemplated differential geneticsequence objects, a reference differential genetic sequence object isgenerated (for example, as a consensus record as described above) andstored in a database. A deviation between a plurality of localdifferential strings in the differential genetic sequence object of aperson and a plurality of local differential strings in the referencedifferential genetic sequence object is then determined to so produce anindividual deviation record for that person, which can then be used togenerate a person-specific deviation profile. Thus, instead of using oneor more physiological parameters (for example, common CBC ordered by aphysician), a differential genetic sequence object for (preferably) theentire genome of a person is compared to a reference differentialgenetic sequence object to so arrive at a significantly morecomprehensive collection of information. Most typically, theperson-specific deviation profile is then matched against normal orreference records for reference differential genetic sequence objects toso accurately and quickly identify the person as matching a specificcondition or disease.

Fundamental Considerations

At first approximation, a tumor growth is a population of cancer cells.This population may homogenous, where all tumor cells sharesubstantially the same genetic characteristics. Such tumors are said tobe monoclonal since all tumor cells feature substantially the samegenetic variants (e.g., copy number aberrations, structural variants,mutations) as compared to the progenitor tumor cell from which the tumorcells propagated. This progenitor tumor cell may be the first cancerouscell that initiated the tumor, or may be a subsequent tumor cell thatgained an advantageous mutation that aided a complete sweep of the tumorpopulation.

On the other hand, polyclonal tumor growths are viewed as tumorscomposed of at least two genetically distinct clonal populations oftumor cells. In polyclonal tumors, each clonal population arose from arespective progenitor clone, but each progenitor clone differs from theother by some observable alteration. Thus, the multiple clonalpopulations may be significantly different from each other, or (as ismore often the case), the clonal populations are related, sharing a setof variants that are found in all or a large subset of tumor cells. Forexample, a polyclonal tumor may comprise multiple major clones, where amajor clone represents a computationally detectable clone (typicallyrepresenting 10% of the tumor population), while the same polyclonaltumor may comprise further numerous minor clones that are undetectablewith any given method.

In addition, it should be noted that individual mutations may beclassified as either clonal or subclonal. In that context, when thedominant clones of a particular tumor are found, clonal variants arethose shared by all tumor cells of any or all dominant clones. Viewedfrom a different perspective, clonal variants achieved full penetrancein the entire population or polyclonal subpopulation of cells. Subclonalvariants are those that exist in only a small proportion of the cellsbelonging to a clonal population.

An example for the above model of tumor and its evolution is provided inFIG. 1 in which an initial germline cell acquired a nonsense mutation ina key tumor suppressor (M1) and amplified an oncogene (A1) thatsupported the initial growth of a tumor. Early on in this tumor'sdevelopment, another tumor suppressor was deleted (D1) that caused thetumor cell to grow even more rapidly, enabling cells with this deletionto rapidly overtake the entire tumor population. Soon after acquiringdeletion D1, a cell also acquires a set of neutral mutations (M2, M3),amplifications (A2, A3), deletions (D2, D3). Since these variantsoccurred early during the clonal expansion of this tumor cell variant,but do not provide any selective advantage, the population of tumorcells are split into two “major clones,” where 25% of tumor cells havethe neutral variants (M2, M3, A2, A3, D2, and D3) and 75% of tumor cellsdo not. Much further during this tumor's development, additionalmutations (M4, M5) appear on one of the two major clones, but do nothave a chance to spread through the population prior to the patient'sdeath and/or tissue biopsy.

In the example of FIG. 1, the tumor population is polyclonal, with itstwo major clones defined such that: clone (1) has variants M1, A1, andD1, and clone (2) shares the variants of clone (1), but in addition hasvariants M2, M3, A2, A3, D2, and D3. The clonal mixture is determined as75% clone (1) and 25% clone (2). Mutations M1, M2, and M3 would all beclassified as “clonal” since they all achieved full penetrance in theirrespective clones, while M4 and M5 would be classified as “subclonal”mutations. Moreover, as can be seen from FIG. 1, a biopsy will typicallyinclude normal tissue in addition to tumor heterogeneous tissue.

Data Extraction and Synthesis

The following presents various systems and methods to extract andsynthesize data to reconstruct the clonal evolution of a tumor fromwhole genome sequencing data of a single tumor biopsy. These systems andmethods provide a powerful framework to determine the clonality of atumor, the number and proportion of all major clones in the tumor, andpossible variants that distinguish the major clones. Furthermore,systems and methods are presented to phase mutations to parental allelesto thereby time their emergence within the population. In addition,contemplated systems and methods will provide an accurate estimate ofthe amount of contaminating normal tissue that was present in a tumorbiopsy.

Copy Number Alterations, Allele Fraction, and the Allelic State Diagram

To discover and describe the major clones of a population, relative copynumber and allele fraction estimates are utilized. Such data can beobtained using algorithms and methods as described in WO2013/074058.Underlying the method to determine both clonality and estimate normalcontamination is the “allelic state diagram” (ASD), which is describedin more detail below. It should be especially appreciated that the ASDdescribes the positions of clonal positions of allele-specific copynumber variants using both relative copy number and allele fraction ofcopy number alterations, thus demonstrating the relationship betweencopy number and allele fraction for all allelic states. The positions ofclonal allelic states in the ASD are determined by the followingEquations I and II:

$\begin{matrix}{{{CN}( {t_{maj},t_{\min},n_{maj},n_{\min},\alpha} )} = \frac{{( {1 - \alpha} )( {t_{maj} + t_{\min}} )} + {\alpha( {n_{maj} + n_{\min}} )}}{n_{maj} + n_{\min}}} & {{Eq}.\mspace{11mu} I} \\{{{AF}( {t_{maj},t_{\min},n_{maj},n_{\min},\alpha} )} = \frac{{( {1 - \alpha} )t_{maj}} + {\alpha\; n_{maj}}}{{( {1 - \alpha} )( {t_{maj} + t_{\min}} )} + {\alpha( {n_{maj} + n_{\min}} )}}} & {{Eq}.\mspace{11mu}{II}}\end{matrix}$where CN is the relative tumor copy number compared to a matched-normal,AF is allele fraction in the tumor, a is the fraction of normalcontamination in the tumor sample, and t_(maj), t_(min), n_(maj), andn_(min) are the majority and minority allelic states in the tumor andnormal, respectively. Since individual genomes can only have discreteallelic states, such that they have 0, 1, 2, or more copies of a givenchromosomal segment, the possible values for t_(maj) and t_(min) areconstrained to the set of positive integers, tiϵ(0, 1, 2, . . . , n).Furthermore, the majority and minority allelic states for the normal areset to one, n_(i)=1, which is true for all of the autosomes in a normalhuman genome. The sex chromosomes, X and Y, are ignored in the ASD. Notethat since the above formulae necessarily require two alleles, onlyheterozygous sites in the matched normal genome are considered for theASD.

In the following figures, particularly significant allelic states arenormal copy number, single-copy amplification, single-copy/hemizygousdeletion, homozygous deletion, copy-neutral loss of heterozygosity(CN-LOH), and amplification of both parental alleles. For example, FIG.2 shows exemplary copy number and allele fraction data for the aboveallelic states with no normal contamination, demonstrating how the ASDcan be used to determine the allelic state of each cluster of points.Here, each vertex in the ASD's grid is labeled with its tumor allelicstate, (t_(maj), t_(min)), and the position is determined by theequations above. FIG. 3 demonstrates how the locations of allelic statesare affected by increasing amounts of normal contamination, α. FIG. 3Ahas no normal contamination (α=0), with FIGS. 3B-D having increasednormal contamination (3B: α=0.1; 3C: α=0.5; 3D: α=0.9). As is readilyapparent, as normal contamination increases, the allelic state positionsgrow closer together, reducing the ability to resolve different allelicstates. It should be especially noted that plotting the copy numberversus allele fraction to produce an ASD provides various technicaladvantages, including the capability to observe and identify clonalitystatus of a tumor and the capability to observe and identify(unidirectional and bidirectional) changes in the clonality status of atumor.

It should be noted that the example of FIG. 3 depicts a static snapshotof a monoclonal tumor. However, it is well known that the tumor genomecan be very dynamic, with gains and losses of small and largechromosomal segments. FIG. 4 exemplarily illustrates some of thepossible transitions between the allelic states described in previousfigures. It should be appreciated that some transitions are “one-way”since they involve the irreversible loss of chromosomal segments. Forexample, the transition between the normal allelic state (1,1) and thehemizygous deletion state (1,0) is “one-way” because that deleted allelecan never be restored. However, the retained allele in this case can beamplified, permitting transitions to the copy-neutral LOH (CN-LOH) stateand beyond (2+,0). Notice that the deletions necessary for thetransition between other allelic states are not deemed “one way” becauseat least one copy remains of each allele remains in the genome.

Based on the above, it should be recognized that allelic states can nowbe identified in a relatively simple manner. For example, FIG. 5displays the ASD of a tumor genome transitioning from the allelic statespresented in the previous figures to new allelic states that differ onlyby a single copy loss or gain. During such a transition, the populationof tumor cells will be a mixture of tumor cells having tumor cells withthe original allelic states and tumor cells with the new allelic states.For the example presented in FIG. 5, one could view this “transitional”tumor as a population divided between two major clones, A and B, whereclone A is defined by the original allelic states, and clone B isdefined by the new allelic states. The mixture fractions shown on thisfigure, M_(b), represents the fraction of clone B within the population,such that the tumor population solely consists of clone A when M_(b)=0,and the population consists only of clone B when M_(b)=1. It isimportant to note that the allelic states for both clones, t_(i,a) andt_(i,b), are still constrained to the set of positive integers.

From FIG. 5, when the mixture fraction M_(b) is such that the tumor is aheterogenous population of cells, M_(b)=0.25, 0.5, 0.75, the allelicstates do not lie on the vertices of the ASD, but rather on the edgeconnecting two vertices. A tumor population in such a state would beclassified as polyclonal. Take, for example, the cluster of points inFIG. 4. In this region of the genome, clone A has the allelic state ofhemizygous deletion, or (1, 0), whereas clone B has amplified clone A'sretained allele, altering its allelic state in this region tocopy-neutral LOH, or (2, 0). When M_(b)=0, the allelic state of the redpoints are found clustered on the ASD vertex representing the hemizygousdeletion allelic states. As M increases (i.e. with increasing amounts ofclone B in the population), the cluster of points progresses along theedge towards the CN-LOH state. At M_(b)=0.5, where there are equalamounts of clone A and B in the population, the cluster of points isfound precisely in the middle of the edge between the LOH and CN-LOHallelic states.

If the tumor population comprises non-derivative clones, or clones thatare distantly related to one another such that their allelic states donot differ by single copy gains or losses, the position of the mixtureof allelic states will not lie along the edges of the ASD, as shown inFIG. 6. As will be discussed in more detail below, such abnormal allelicstates can also occur when more than 2 major clones exist in apolyclonal tumor. Thus, it should be recognized that the ASD can, at aglance, indicate the presence of one or more major clones in a tumorsample, help determine the allelic states of the major clones, andprovide a visual estimate of the proportion of each major clone in thetumor population, rendering the ASD a powerful diagnostic tool fordetermining clonality of a tumor sample. Moreover, it should beappreciated that plotting the copy number versus allele fraction toproduce an allelic state diagram advantageously allows determination ofmixture fractions in non-monoclonal related/derivative orunrelated/non-derivative tumors.

Fitting Sequence Data to the ASD

The mathematical construct behind the ASD is expressed in Equation I andII above is modeling the ideal case where the relative copy number is1.0 and the majority allele fraction is 0.5 for normal (1, 1) allelicstates. However, the results produced by sequence analysis on real worlddata do often not precisely fit this idealized case. To estimate therelative copy number, sequence analysis (e.g., as described inWO2013/074058) calculates the relative coverage between tumor andnormal. If the tumor and normal samples are sequenced at the samecoverage level, relative coverage is an accurate measure of relativecopy number. However, this will not be the case if the tumor sample issequenced at a much higher coverage than its matched-normal, in anattempt to improve detection of mutations, particularly subclonalmutations, in the tumor sample.

For example, and assuming no normal contamination, if a tumor issequenced at twice the coverage of its matched-normal, then a regionwith a “normal” allelic state will have twice as many reads in the tumoras it has in the normal. Thus, this region has a relative coverage of2.0 and a relative copy number of 1.0, and the so determined relativecoverage will not fit the ASD. Unfortunately, the precise coverage levelof a given sequencing dataset is unknown, as the sequencing servicesoften only target the desired coverage level, but have no guarantee ofachieving it. Using the raw number of reads found in the tumor andmatched-normal datasets as an estimate of overall coverage level canhelp correct the imbalance, but is complicated by the ploidy of thetumor sample. If a tetraploid tumor (ploidy=4.0) and its matched-normal(ploidy=2.0) are sequenced at the same physical coverage, the tumor willhave two times the number of raw reads than the matched-normal. So,using the ratio of their raw numbers of reads to scale local relativecoverage estimates would this tetraploid tumor to appear to have normalcopy number.

The error in the estimate of allele fraction that sequence analysis(e.g., as described in WO2013/074058) produces is caused by a limitationin how the majority allele is selected in regions of allelic balance,such as the “normal” allelic state. Ideally, the allele fraction forsuch regions should be approximately 0.5, but this only occurs when bothalleles have equal read depths. More often, due to the stochastic natureof how heterozygous alleles are sampled from a pool of genomic DNA, oneof the two alleles will likely have a slightly higher read depth thanthe other, causing a slight increase in the majority allele fractionthat is estimated.

For example, assuming no normal contamination, a whole genome with 30×coverage would ideally produce 15 of both alleles at heterozygous“normal” allelic states. However, if one allele's read depth was shiftedby just a single read, such that allele A has read support of 16, thesequence analysis (e.g., as described in WO2013/074058) would estimatethe majority allele fraction to be 16/30=0.53, a deviation of 0.03 fromthe actual allele fraction. Usually averaging over multiple positionscan reduce the effect of such errors, the error in majority allelefraction for these balanced allelic states cannot be averaged outbecause majority allele fraction, by its definition, can never dip below0.5. Fortunately, sampling error has a much less pronounced effect onamplified and deleted allelic states. In these cases, the majorityallele is readily identifiable and the sampling error can be averagedout over multiple positions.

In order to fit sequence analysis results (e.g., as described inWO2013/074058) onto the idealized ASD, the above errors can be modeledand corrected out from the data. The model has four parameters: normalcontamination α, allele fraction delta AF_(d), coverage delta COV_(d),and coverage scaling factor COV_(E). The α parameter only affects gridlayout of the ASD, as shown in FIGS. 3A-D. The latter three parameterstransform the sequence analysis results. The parameters COV_(d) andCOV_(E) affect the y-axis shift of the copy number data and scale ofcopy number data from the “normal” allelic state according to thefollowing equation:CN _(corr)(CN,COV _(d) ,COV _(s))=COV _(s)(CN−COV _(d))+1.0where CN is the relative copy number estimate produced by the sequenceanalysis and CN_(corr) is the corrected copy number used to compareagainst the ASD. The final parameter, AF_(d) has its strongest effect onthe allele fraction estimates of the allelic balanced states. It doesthis with the following equation:

${{AF}_{corr}( {{AF},{AF}_{d},{CN}_{corr},x} )} = {{AF} - {\frac{{AF}_{d}}{{CN}_{corr}}( \frac{1.0 - ( {{AF} - {AF}_{d}} )}{0.5} )^{x}}}$where x is set to a large integer (e.g. x=20) to rapidly reduce thedegree to which allele fraction estimates are corrected as they divergefrom balanced allelic states. It should be noted that the allelefraction estimates at deleted states should not be appreciably alteredas they are the determining factor for estimating normal contamination.

The optimal values for these four parameters are discovered usinggradient steepest descent search, optimizing the RMSD of the correctedcopy number and allele fraction estimates, CN_(corr) and AF_(corr), tothe ASD defined by the normal contamination parameter, α. The searchbegins with a set of initial values for each parameter, and a set ofincrements for each parameter, COV^(i) _(d), COV^(i) _(s), AF^(i) _(d)and α^(i). For each parameter, p, and parameter increment, p^(i), theRMSD from the ASD is calculated for p, p+p^(i) and p−p^(i). Theparameter value that yields the greatest reduction in RMSD among allfour parameters is chosen as the new current value for that parameter,and the cycle repeats. If no reduction in RMSD is possible with thecurrent parameter increments, the increments are divided by half and thesearch resumes. Once three rounds of divisions have occurred, the searchis concluded and the best fit parameters are reported. Since gradientdescent can often get stuck in a local minimum, gradient search isperformed with a number of different initial parameters until aconsistent set of fit parameters is found. Therefore, it should be notedthat by taking onto account the actual coverage of the sequence reads(e.g., tumor reads versus normal reads) as described above will allow toidentify allelic states even where coverage between tumor and normal isnot identical (or even unclear).

Modeling the Clonal Mixture of a Tumor Sample

The ASD can then be used to determine a set of allelic state “landmarks”that help define the number of distinct clones and their proportionswithin the tumor population, L_(i)=(CN_(corr,i), AF_(corr,i)). Thelandmarks used in this analysis will be defined by the large clusters ofpoints on the ASD, as they indicate major portions of the tumor thathave undergone a copy number change in a significant fraction of theoverall tumor population. See FIG. 7B for the landmark allelic statesused to analyze GBM-06-0185. For each landmark on the ASD, all plausibleclonal mixtures that would result in its observed copy number and allelefraction are considered, then the optimal clonal mixture is chosen suchthat it can account for all ASD landmarks most parsimoniously.

As observed in FIG. 5, for monoclonal tumor populations, one wouldexpect the landmarks to all lie on ASD vertices. However, in polyclonaltumors comprising two major clones, where clone B inherits all of cloneA's allelic states and has additional allelic states distinct from cloneA, one would expect to find landmarks on both the vertices and edges ofthe ASD. The landmarks that lie on vertices are those that represent theallelic states shared by both clone A & B, while the landmarks on ASDedges represent the mixture of the different allelic states. Theposition along this connecting edge determines the proportions of cloneA and B in the mixture. If multiple landmarks are found on edges and noton vertices, then the variety of positions along their respective edgeswill determine the number of clones.

For example, if all landmarks are found halfway between two allelicstates, the example is most simply explained by two major clones inequivalent proportion within the population. If, however, one landmarkis located at the halfway mark and another is found 25% along the waytowards an allelic state, there must be more than two clones in thepopulation. One simple explanation for this is that there are threeclones, A, B, & C, where A makes up 50% of the tumor population andclones B & C make up 25% each. Assuming clones B & C both exhibit asingle-copy allelic state change from clone A, explaining the halfwaylandmark. The 25% landmark is then explained if, in that chromosomalsegment, clone B (or C) experienced a single-copy allelic state changenot found in clones A and C (or B). Thus, the problem at hand isdetermining the least number of major clones that explain n observedlandmarks, which can be expressed as:L ^(obs)ϵ(L ₀ ^(obs) ,L ₁ ^(obs) , . . . ,L _(n) ^(obs))where L^(obs) _(i)=(CN^(obs) _(i),AF^(obs) _(i)). One can then assume amixture of m clones, each with k integral majority and minority allelicstates C_(i)=[(t⁰ _(maj,i), t⁰ _(min,i)), (t¹ _(maj,i), t¹ _(min,i)), .. . , (t^(k) _(maj,i), t^(k) _(min,i))], and mixture proportions, M_(i),such that ΣM_(i)=1.0−α. The relative copy number and allele fraction ofeach landmark, L^(mix) _(i), is a linear combination of the allelicstates indexed by i across the clonal mixture:

${CN}_{i}^{mix} = \frac{{2\;\alpha} + {\sum_{k}^{m}{M_{k}( {t_{{maj},i}^{k} + t_{\min,i}^{k}} )}}}{2}$${AF}_{i}^{mix} = \frac{\;{\alpha + {\sum_{k}^{m}{M_{k}( t_{{maj},i}^{k} )}}}}{{2\alpha} + {\sum_{k}^{m}{M_{k}( {t_{{maj},i}^{k} + t_{\min,i}^{k}} )}}}$where the normal allelic states for all clones are assumed to be n^(k)_(maj,i)=n^(k) _(min,i)=1. The optimal solution is the one that mostclosely approximates the observed landmarks with a simplest mixture ofmajor clones, or optimizing the objective function:

${O( {L^{obs},L^{mix}} )} = {{\frac{1}{n}\sqrt{{\sum\limits_{i}^{n}( {{CN}_{i}^{obs} - {CN}_{m}^{mix}} )^{2}} + ( {{AF}_{i}^{obs} - {AF}_{i}^{mix}} )^{2}}} + m^{x}}$which is the RMSD from the observed data plus a penalty for the numberof clones in the population, controlled by the strength parameter x.

The method is performed after finding the best fit parameters. It beginsby identifying all “shared” landmark allelic states, which every clonein the mixture must exhibit. If we assume a tumor is step-wise evolving,these shared allelic states represent the “root” of the tumorsevolutionary tree. If there are no landmarks on ASD edges, the procedureis complete and the tumor population is classified as monoclonal.

If landmarks exist along ASD connecting edges, between two boundingallelic states, then additional clones are necessary. The procedure addsone additional “daughter” clone to the mixture, which inherits all ofthe shared allelic states and gains an allelic state and mixtureproportion necessary to explain the edge-bound landmark. If more thanone edge-bound landmark can be explained with the same mixtureproportion, then those new allelic states are added to the new clone.This process is repeated until all non-vertex landmarks are explained bythe clonal mixture, wherein each additional “daughter” clone can derivefrom any current clone in the mixture that bounds one side of anunexplained landmark. Once all landmarks can be reasonably explained,the clones' allelic states and mixture proportions are reported.

It should be noted that from the equations above the combination ofallelic states that uniquely determine each landmark's position on theASD can also determine the phased set allelic states for all positionsin the genome that correspond to the landmark. This can only work whenthe mixture proportions are unique for each clone, i.e. the major clonesmust unevenly split the tumor population. In such cases, this enableswhole genome, clone-specific karyotypes to be inferred for each clone inthe tumor population. Consequently, using an allelic landmark willprovide a technical advantage in that it is now possible to define thenumber of distinct clones and their proportions within the tumorpopulation.

Linking Mutations to Clone-Specific Allelic States

To achieve an even greater understanding of the evolution of a tumor,one need not be constrained to exclusive analysis of copy numberchanges. By integrating somatic mutations into the above discussedframework, it is now possible to determine when a mutation arose duringthe tumor's development. To do this, one or more mutations will bedirectly linked to the majority or minority allele in the encompassingchromosomal region on the ASD. Then the mutation's allele fraction isused to determine whether the mutation occurred prior to the change inallelic state, soon after the allelic state change, or much later. Suchanalysis can be performed in two different manners.

Via direct phasing: For every mutation discovered by sequence analysis,all nearby germline heterozygous variants can be identified to identifypaired reads that physically connect, or “phase,” the mutation allele toa specific germline allele. “Nearby” is defined in this context as beingseparated by no more than double the insert size of the paired readlibrary, typically 1,000 bp for these whole genome libraries, as that iswell outside the expected distance that would separate two paired reads.

All read pairs that overlap the positions of both the mutation and thegermline variant are collected and the number of times the mutation isphased to either germline variant alleles is recorded. If the mutationis found linked to the same germline variant allele more than once, andis not found also phased to the other allele of that germline variant,it is considered to be directly phased to that germline variant allele.Phasing can be made either within a single read if the mutation andgermline variant are separated by less than a read's length, or canoccur across mates of a read pair. Mutations can also be phased tomultiple germline variant positions.

For every mutation that can be directly phased to a germline variant,the germline variant's allele fraction is used to determine if themutation is phased to the majority or minority allele. If the germlinevariant's allele fraction is determined to be greater than or equal to0.5, then the mutation is deemed “majority-phased,” otherwise it wasphased to the minority allele, or “minority-phased.” Note that in thecases of when the two allelic states are equal, such as normal (1, 1) orbi-allelic, balanced amplifications (2, 2), the mutation's assignment to“majority” or “minority” allele depends on whichever allele was sampledslightly deeper in the sequencing data. Thus, classifying mutations as“majority-phased” or “minority-phased” in such cases is not meaningful.

Via amplified allele fraction: When direct phasing cannot be made, theability to determine which allele the mutation is linked to is severelylimited. However, when mutations are found within an amplifiedchromosomal segment, one can use the mutation's allele fraction todetermine to which allele the mutation may be linked. When themutation's allele fraction is approximately equal to the majority allelefraction, this can only have occurred if the mutation was present in theamplified allele prior to the amplification. If the mutation was insteadon the un-amplified allele, the mutation's allele fraction wouldnecessarily be much lower.

However, low mutation allele fractions do not necessarily indicate thatthey are not “majority-phased,” since mutations can occurpost-amplification. For example, if a region was amplified by a singlecopy, allelic state (2, 1), a post-amplification mutation could be, atmost, present on one copy of the majority allele, with a maximal allelefraction of 1/2+1=0.33, compared to the expected allele fraction of apre-amplification mutation, 2/2+1=0.67.

Thus, one is limited to linking un-phased mutations to amplifiedsegments when the mutations occur prior to the amplification.Nevertheless, this can be still be useful, as one would expect oncogenicmutations to occur early in the tumor's development, as they are likelyto drive tumor growth. If multiple copies of these oncogenic mutationsare selectively advantageous for the tumor cell, then one would expectthe requisite increase in mutation copy number and allele fraction toenable the user to employ this method.

Comparing Allele Fractions to Infer Mutation Timing

After assigning mutations to the majority or minority alleles, one canthen compare the allele fraction of the mutation to the allele fractionof the majority or minority allele fraction of the chromosomal segmentthat encompasses the germline variant allele. It is generally preferredto use the allele fraction of the chromosomal segment instead of thegermline variant allele because the estimate of the chromosomalsegment's allele fraction is more accurate due to averaging over allgermline heterozygous positions within the segment. To accuratelycompare a mutation's allele fraction to the majority or minority allelefraction of heterozygous positions, one must add in some “normal”contamination to the mutated allele. Note that majority allele fraction,AF, features normal contamination in both its numerator and denominator.This is due to the fact that the positions considered in these equationsare heterozygous in the normal, and thus one expects to get normalcontamination from both alleles. However, for a somatic mutation, thereis no normal contamination of the mutant allele, as the mutation doesnot exist in the normal:

${MAF} = \frac{( {1 - \alpha} ){mt}_{maj}}{{( {1 - \alpha} )( {{mt}_{maj} + {( {1 - m} )t_{maj}} + t_{\min}} )} + {\alpha( {n_{maj} + n_{\min}} )}}$where MAF is mutant allele fraction, m is the fraction of copies of thetumor allele t_(maj) that are mutated, and t_(maj), t_(min), n_(maj),and n_(min) represent the same homozygous allele. To fairly compare MAFto allele fractions estimated at heterozygous positions, the followingcorrection is employed:

$\begin{matrix}{{MAF}_{c} = {{MAF} + \frac{\alpha\; n_{maj}}{{( {1 - \alpha} )( {t_{maj} + t_{\min}} )} + {\alpha( {n_{maj} + n_{\min}} )}}}} \\{= \frac{{( {1 - \alpha} ){mt}_{maj}} + {\alpha\; n_{maj}}}{{( {1 - \alpha} )( {t_{maj} + t_{\min}} )} + {\alpha( {n_{maj} + n_{\min}} )}}}\end{matrix}$where MAFc is the corrected mutant allele fraction. Note that while m isallowed to be any fraction less than or equal to zero in the aboveequations, there are some values of m that have special meaning. If m=1,then all of the t_(maj) alleles were mutated, and in the cases wheret_(maj) represents an amplified allele, when m=1 the mutation must haveoccurred prior to the amplification. When m=1 t_(maj), where t_(maj)represents the number of copies of the amplified allele, then one knowsthat the mutation must have occurred soon after the amplification, sinceit exists on a single copy of the amplified allele but is found in thisstate in the majority of tumor cells. If, however, m<<1/t_(maj), thenthe mutation must have occurred after the amplification, likely verylate during the tumor's growth, as its very low allele fractionindicates it's only found in a small fraction of tumor cells.

If the mutation is phased to the minority allele, t_(min), one expectsto find the maximum mutation fraction to be m=t_(min)/t_(maj) as thatindicates that all copies of the minority allele were converted. So,when the minority allele state exists in single copy and all copies ofit were mutated, m=1 t_(maj), precisely the same mutant fraction onewould calculate for a “majority-phased” mutation present at single copy.Thus, only with the direct phasing one can distinguish between an early“minority-phased” mutation and mutations that occur post-amplification.

Examples

GBM (Glioblastoma Multiforme):

12 whole genome GBM samples were processed with the above methods todetermine the level of normal contamination and the clonality present ineach tumor biopsy. The relative coverage and allele fraction produced byBamBam for the other 5 whole genome GBM samples discussed in previoussections possessed too much variability to be analyzed by these methods.The results of the clonality analysis are summarized in Table 1.

Sample Clonality Normal Cont. (α) # Major Clones GBM-06-0145 monoclonal22.5% 1 GBM-06-0155 monoclonal 24.5% 1 GBM-06-0877 monoclonal 29.8% 1GBM-06-0648 polyclonal 12.5% 2 GBM-06-0152 polyclonal 24.1% 3GBM-06-1086 polyclonal 7.5% 4 GBM-06-0185 polyclonal 14.6% 4 GBM-14-1454polyclonal 5.9% >1 GBM-14-0786 polyclonal 13.9% >1 GBM-06-0188polyclonal 20.6% >1 GBM-06-0214 polyclonal 33.6% >1 GBM-26-1438polyclonal 50.0% >1

Surprisingly, only 3 GBM tumor samples were found to be monoclonal,while the other 9 samples included at least two major clones. For 7 GBMtumors, the precise mixture of clones was determined, while theremaining 5 tumor were inspected visually to determine their clonality.

Results of two tumors, GBM-06-0145 and GBM-06-0185, are shown in FIGS.7A and 7B. The relative coverage and allele fraction data of these twosamples were transformed using the best fit parameters as describedabove, demonstrating close fit to the ASD with estimated normalcontamination levels of 21.5% and 14.6%, respectively. By inspecting thelocation of the data cluster, whether on vertices or edges, one canvisually determine the clonality of these tumors. Since all ofGBM-06-0145's (FIG. 7A) data cluster around ASD vertices, it is likelythis tumor is monoclonal. On the other hand, GBM-06-0185 (FIG. 7B) isclearly polyclonal, since the several large clusters along the ASD edgesindicate the presence of at least two major clones in this tumor. Infact, since the edge-bound clusters are found at different positionsalong their edges (e.g. some clusters are at the halfway mark, whileother clusters are approximately 0.75 and 0.80 of the way towards thesingle-copy deletion state, respectively), this can only occur from amixture of at least three major clones.

To determine precisely the number of clones in these samples, theinventors used the methods described above to determine the number ofclones and their allelic states. For each inferred clonal mixture, theinventors computationally determined relative copy number and allelefraction for every position in the genome given the derived clonalmixture and compared it against the results produced by the sequenceanalysis. This provides a metric to determine how well the clonalmixture models the observed data.

As shown in FIG. 8, the inventors found a single clone for GBM-06-0145,as expected. The computationally-derived relative copy number and allelefraction data shows a very good fit to the observed data. A total offour major clones were found for GBM-06-0185, whose clonal allelic stateis presented in FIG. 9. There are two important things to note from thefour clones presented here. Firstly, as described before, the fact thateach clone's mixture proportion is different from all others helps tophase the allelic states across the whole genome into clone-specifickaryotypes. Secondly, all clones appear to have derived from clone A.Each derivative clone shares all of the events found in clone A,suggesting that clone A is the progenitor of clones B, C, and D. It isunclear, however, if this set of clones evolved linearly, in a stepwiseprogression, or if clone B and clones C & D represent differentlineages. These latter three clones differ by the deletions in chr6q,where clone B features a set of focal deletions while clones C & D havelost all of chr6q. These are not mutually exclusive events, so it ispossible that clone C had derived from B, inheriting its focal deletionsand subsequently deleting the remainder of chr6q. However, nothingprecludes clones B and C from deriving directly from clone A andindependently deleting parts of chr6q. It is interesting that it isclone D, the last clone of the tree, that becomes the dominant clone inthe tumor population according to mixture proportion, suggesting thatthe events unique to this clone (e.g. amplification of chr9) may haveprovided a growth advantage to this clone.

The clonal karyotypes for GBM-06-0152 is shown in FIG. 10. This tumor isinteresting because the amplification of chr7, a characteristic ofapproximately 40% of GBM tumors, does not occur until clone B. It shouldbe noted that this sample was also shown in an independent analysis tohave two double minute chromosomes, one with MDM2 & CDK4 and anothercontaining EGFR, that were borne out of a chromothripsis-like event.While extremely amplified genomic regions are difficult to model inthese clonal karyotypes, evidence of the deletions related to theseevents on chr12 in clone A can be seen, suggesting that these doubleminutes occurred early in the tumor development. It is possible that theearly focal amplification of EGFR may have played a role in the lateremergence of chr7 amplification.

The clonal evolution presented by the karyotypes for sample GBM-06-1086has a few interesting aspects that are worth describing here. The firstsubtle thing to notice in its karyotype, shown in FIG. 11, is that thefocal deletion of CDKN2A occurs does not occur until clone B, suggestingit occurred after the complete loss of chr9 observed first in clone A.This is strong evidence supporting the hypothesis that focal losses ofCDKN2A likely occur after arm-level or entire chromosomal losses ofchr9. The second interesting aspect is that clones C and D have lossesof 13 different entire chromosomes. Clone D takes this one step furtherby deleting its last copy of chr18, as well as amplifying chr19. Thisreduces the ploidy of both clones C and D to 1.31, from theapproximately normal ploidy shared by the other two clones(ploidy=1.95). It is remarkable how cells that have lost almost 30% oftheir genomic content can not only survive but, given the 41.8% mixtureproportion of clone D, apparently thrive in a population of tumor cells.

Lung Squamous Cell Carcinoma (LUSC):

Whole genome data for nine squamous cell carcinomas of the lung (LUSC)sequenced by TCGA were analyzed by these methods to infer clonality. Theallelic state diagrams of two tumors are shown in FIGS. 7C and 7D. Fromthe greater number of transitional allelic states evident in these twosamples, it appears that these LUSC tumors exhibit a much higher degreeof clonality compared to the GBM tumors described above.

The tumor sample LUSC-66-2756 shown in FIG. 7D exhibits numerous highlyamplified states at ASD vertices (states common among all major clones)and ASD edges (states shared by only a subset of major clones). A widevariety of mixture proportions is evident from the almost continuous setof different positions of point clusters along, and in between, ASDedges, suggesting that this sample is highly polyclonal. Anotherinteresting feature of this sample is that none of its genome is foundin the single copy loss allelic state (1,0). This may have occurred viaa genome doubling event where the tumor genome was briefly tetraploid(N=4), then a series of chromosomal deletions led to either single copygain, “normal,” or CN-LOH allelic states. Genome doubling events arebelieved to often occur in serous ovarian carcinomas to explain howlarge portions of their genomes are observed in the CN-LOH allelicstate.

Phased Mutations to Allelic States:

To visualize the phased mutations onto allelic states, the inventorsused a slightly modified allelic state diagram, the dual allelic statediagram (dual ASD). Noting from the equations above that since minorityallele fraction is the complement of majority allele fraction(AF_(min)=1.0−AF_(maj)), one can construct a dual ASD by placing a minorimage of the ASD to display the location of the minority allelic states.Mutations phased to germline variants corresponding to the majorityallele, minority allele, or neither, are plotted on the dual ASD. Bydetermining which allelic state (majority or minority) the mutations arenearest and using their phased status (if any), one can infer the timingof the mutations.

An exemplary dual ASD is shown in FIG. 12, which presents a series ofmutations that are phased to germline variants belonging to either themajority or minority alleles in two different allelic states. Eachmutation's allele fraction is corrected as noted above and placed ontothe dual ASD. Based on their phase and mutant allele fraction, the dualASD assists in the identification on how many copies of the majority (orminority) alleles the mutation is present. In the case of theamplification presented in FIG. 12, mutations present on both allelesversus those on just one of the amplified copies is readilydistinguished, allowing visual determination whether a mutation occurredbefore or after the amplification. Similarly, for mutations phased tothe minority allele, one would see them having MAFc≤0.5 for all but the“normal” allelic state where their phased assignment is not meaningful.

The dual ASDs for tumor GBM-06-0145 are shown in FIG. 13. 6 regions onthese diagrams are highlighted to help with interpretation of thesediagrams on real data. Regions (a) and (b) show mutations that weredirectly phased via nearby germline variants to the majority allele, butonly 2 mutations in (b) are found to have the MAFc corresponding to anamplified mutation. Most majority-phased mutations are found in region(a), corresponding to mutations at single copy number, discovering thatthese mutations occurred post-amplification. An unphased, missensemutation in DOCK8 is found in the single copy loss allelic state,meaning that the only copy of DOCK8 remaining in this tumor is inmutated state. Inactivation of DOCK8 through homozygous deletion hasbeen linked to progression of lung cancers, so the lack of wildtypeDOCK8 in this GBM tumor may have played a role in its tumorigenesis.FIG. 13 also demonstrates the high degree of variation in estimates ofMAFc from these average coverage whole genomes.

The most striking thing about the dual ASD for tumor LUSC-34-2596, shownin FIG. 14, is the sheer number of mutations, phased or unphased, acrossall expected allelic states. Compared to the previous GBM tumor, it isclear that the mutation rate of LUSC-34-2596 is significantly higher.This is expected since lung tumors exhibit some of the highest mutationrates among the cancers studied thus far by TCGA.

The inventors observed a great number of both majority- andminority-phased mutations in the balanced-amplified allelic state (2,2)at the expected MAFc≈0.5, labeled (a) and (b) in FIG. 14. The inventorsalso observed a cluster of mutations to the left of these regions thatcorresponding to mutations at single copy number. The location of amajority-phased missense mutation in NDRG1, a gene recently discoveredto be up-regulated in squamous cell lung cancer, is found in a genomicregion in between the “normal” and single copy loss allelic states. ItsMAFc is approximately equal to the allele fraction of the genomicregion, suggesting that the mutation exists on both clones (i.e. theclone with “normal” allelic state and the clone with single copy lossallelic state). This is evidence that the mutation occurred early, priorto the emergence of the second clone featuring the new deletion, andthat the deletion contained the wildtype version of NDRG1.

The location of three unphased mutations, BRAF, DNMT3A, and TP53, arealso highlighted in FIG. 14. The nonsense mutation in TP53 is found in aCNLOH state and has a mutant allele fraction that precisely correspondsto the CN-LOH allele fraction, meaning that this tumor has deleted onecopy of TP53, knocked out the remaining copy via mutation, and thenamplified the mutant allele. The region encompassing BRAF was highlyamplified, and it is clear from BRAF's MAFc that the mutation occurredprior or early in its amplification. BRAF mutations occur frequently inmelanomas but have been recently discovered in a small percentage ofnon-small cell lung carcinomas. Since more than half of the copies aremutated, the mutation could not have occurred after the amplificationprocess had finished unless BRAF was independently and identicallymutated on multiple copies, a highly improbable event. DNMT3A, a genewhose loss is implicated in lung cancer and other tumor types, is foundin the “normal” allelic state and has the expected MAFc 0.5. In all ofthese cases, mutations to these genes must have occurred early duringtumorigenesis as they are present in all (or, in the case of BRAF, atleast the majority) of the tumor's major clones. Coupled with the factthat these are genes known to be implicated in multiple tumor typesraise the possibility that one or more of these mutations are drivers ofthis particular tumor.

Table 2 below summarizes the phaseable mutations for 12 GBM and 8 LUSCtumors. Again, the higher overall rate of mutations in the LUSC tumorsrelative to the GBM tumors should be noted. Also, it is clear thatsignificantly more mutations are found at single copy within theamplified regions of the GBM tumors, whereas one can find mutationsuniformly distributed across the amplified allelic states in LUSCtumors.

Maj- Amp. Single Min- Sample Total Phased State Copy Phased GBM-06-01881,595 67 2 1 (50%) 19 GBM-06-0648 1,600 68 9 5 (56%) 34 GBM-06-08774,792 367 130 84 (65%) 145 GBM-14-1454 2,539 291 55 38 (69%) 80GBM-06-0152 2,896 269 47 35 (74%) 80 GBM-26-1438 2,820 239 36 32 (89%)86 GBM-06-0155 7,167 1,108 99 93 (94%) 440 GBM-06-0214 2,827 237 31 29(94%) 83 GBM-06-0145 3,800 511 56 53 (95%) 225 GBM-14-0786 5,863 783 117114 (97%) 241 GBM-06-1086 3,993 412 10 10 (100%) 129 GBM-06-0185 1,19520 2 2 (100%) 14 LUSC-66-2756 45,995 7,336 6,869 2,107 (31%) 2,104LUSC-56-1622 19,343 3,159 286 99 (35%) 1,016 LUSC-60-2695 6,265 830 408148 (36%) 302 LUSC-43-3394 18,905 4,162 694 267 (38%) 1,184 LUSC-66-275711,400 1,477 374 177 (47%) 498 LUSC-60-2713 21,337 4,290 503 262 (52%)1,207 LUSC-34-2596 30,794 7,166 663 349 (53%) 1,410 LUSC-60-2722 32,4855,101 504 319 (63%) 1,682 “Total” = # phased and unphased mutationscalled, “Maj-Phased” = # majority-phased mutations, “Amp. State” = #majority-phased mutations in regions of amplified allelic state, “SingleCopy” = # majority-phased mutations in regions of amplified allelicstate, but has MAF_(c) corresponding to single copy, “Min-Phased” = #minority-phased mutations.

Assuming the mutation rate remained constant throughout the developmentof these tumors, then the amplifications occurred early in thedevelopment of the GBM tumors, before most mutations occurred. Usingthis same reasoning, mutations and copy number alterations are frequentoccurrences of LUSC tumor development, with large numbers of mutationsoccurring prior to and after amplification events.

Another possibility explaining the difference in mutation patters isthat the mutation rate did not remain constant during development.Suppose that amplification of the growth factor EGFR, a common event inthese GBM tumors, increases the cell's rate of growth and subsequentlyreduces the cell's ability to correct mistakes made during genomereplication, thereby increasing the mutation rate per cellular division.This could explain the enrichment of mutations present in single copywithin amplified allelic states. However, without knowledge of thenumber of generations that occurred before and after EGFR amplification,one cannot determine if the mutation rate increased. Nevertheless, itshould be appreciated that using the ASD and dual ASD methods presentedherein, significant and clinically relevant information can be drawnfrom sequence analysis output that in an unprecedented manner.

It should be apparent to those skilled in the art that many moremodifications besides those already described are possible withoutdeparting from the inventive concepts herein. The inventive subjectmatter, therefore, is not to be restricted except in the spirit of theappended claims. Moreover, in interpreting both the specification andthe claims, all terms should be interpreted in the broadest possiblemanner consistent with the context. In particular, the terms “comprises”and “comprising” should be interpreted as referring to elements,components, or steps in a non-exclusive manner, indicating that thereferenced elements, components, or steps may be present, or utilized,or combined with other elements, components, or steps that are notexpressly referenced. Where the specification claims refers to at leastone of something selected from the group consisting of A, B, C . . . andN, the text should be interpreted as requiring only one element from thegroup, not A plus N, or B plus N, etc.

What is claimed is:
 1. A computer implemented method of ex-vivodetermining clonality of a tumor using whole genome sequencing dataobtained from the tumor, comprising: storing in a non-transitory,computer readable memory of a BAM server, whole genome sequence datafrom a first and second tissue sample of a patient; wherein the firsttissue sample is a tissue sample from the tumor, and the second tissuesample is a matched normal tissue sample; generating, by the BAM server,a plurality of genetic sequence strings from the whole genomic sequencedata of the first and second tissue sample incrementally synchronizing,by the BAM server, the genetic sequence strings using one or more knownpositions of at least one of corresponding genetic sequence strings toso produce a plurality of local alignments; analyzing, by the BAMserver, the so produced plurality of local alignments to generate aplurality of local differential strings between the first and secondsequence strings within the local alignments; generating, by the BAMserver, a plurality of differential sequence objects for the genome fromthe plurality of local differential strings, wherein differentialsequence objects comprise a copy number and an allele fraction for anallele within the tumor genome; calculating an allelic state for theallele based on the determined copy number and the determined allelefraction; using the allelic state to determine clonality; wherein thecalculating the allelic state comprises at least one of a correction fornormal contamination and an identification of a mixture fraction Mb foran allele, wherein the tumor is polyclonal when M_(b) for the allele isgreater than 0 and smaller than 1; identifying a clone of an initialtumor mass or a clone that is set to become a dominant clone using theclonality.
 2. The method of claim 1, wherein the allelic state isidentified as being a state selected from the group consisting of: anormal copy number, a single copy amplification, a singlecopy/hemizygous deletion, loss of heterozygosity followed by one or moreamplifications of the remaining allele, and an amplification of bothalleles.
 3. The method of claim 1, wherein the step of calculating theallelic state comprises the correction for normal contamination.
 4. Themethod of claim 1, wherein the step of calculating the allelic stateuses majority and minority allelic states for tumor and normal.
 5. Themethod of claim 1, wherein the tumor is monoclonal when M_(b) for theallele is either 0 or
 1. 6. The method of claim 1, wherein the step ofcalculating the allelic state comprises a correction for sequencingcoverage level.
 7. The method of claim 1, further comprising a step ofdetermining an allelic state landmark.
 8. The method of claim 7, furthercomprising a step of using the allelic state landmark to determine atleast one of a number of clones in the tumor and a proportion of clonesin the tumor.
 9. The method of claim 1, further comprising a step oflinking a mutation to a majority allele or a minority allele, and usingthe mutational allele fraction for determination of timing of themutation relative to a change in allelic state.
 10. The method of claim1, further comprising a step of plotting the allelic state in an allelicstate diagram.
 11. The method of claim 1, further comprising a step ofplotting the allelic state in a dual allelic state diagram.
 12. Themethod of claim 1, further comprising: identifying a clone as theprogenitor clone of other clones of the tumor.
 13. A computerimplemented method for providing treatment information for treatment ofa tumor, comprising: storing in a non-transitory, computer readablememory of a BAM server, whole genome sequence data from a first andsecond tissue sample of a patient; wherein the first tissue sample is atissue sample from the tumor, and the second tissue sample is a matchednormal tissue sample; generating, by the BAM server, a plurality ofgenetic sequence strings from the whole genomic sequence data of thefirst and second tissue sample incrementally synchronizing, by the BAMserver, the genetic sequence strings using one or more known positionsof at least one of corresponding genetic sequence strings to so producea plurality of local alignments; analyzing, by the BAM server, the soproduced plurality of local alignments to generate a plurality of localdifferential strings between the first and second sequence stringswithin the local alignments; generating, by the BAM server, a pluralityof differential sequence objects for the genome from the plurality oflocal differential strings, wherein differential sequence objectscomprise a copy number and an allele fraction for an allele within thetumor genome; ascertaining allelic state for the tumor, wherein theascertaining the allelic state information comprises generating anallele state diagram and determining location of the allelic state forthe allele in the allele state diagram; identifying presence oremergence of (a) a clone or (b) an evolutionary pattern of clones withinthe tumor that is indicative of at least one of susceptibility of thetumor to treatment with a drug and an increased risk of drug resistance.14. The method of claim 13, wherein the step of identifying presence oremergence is based on prior treatment data or a priori known data. 15.The method of claim 13, further comprising: identifying, using theallele state diagram, at least one clone as the progenitor clone ofother clones of the tumor.
 16. The method of claim 13, wherein thetreatment targets the alteration specific to the clone that is mostupstream in the evolutionary pattern.